Some Baseball Data
We have two data files, Batting.csv (95195 lines of text) and Master.csv (17916 lines of text).
Batting.csv (partial)
Master.csv (partial)
Analysis Goals
- Find the player with the highest run for each year.
- What are the First and last names of the player who had the highest run in each year? Display the name, player id, year and the highest run.
- Who (First name and last name) and in what year had the highest run of all the years in the dataset?
- What is the average run (rounded to two decimal points) of each year?
- What is the average run (rounded to two decimal points) of all the years in the dataset?
Load the Data and Create the Relations
The Beginning Portion of Relation B
This Achieved the Same Result as in Hive Case Study
Another Way to Create the Relations
The Beginning Portion of Relation B
Find the Highest Run for Each Year
The Beginning Portion of Relation D, MAX(run) by Year
Find The Highest Run for Each Year with Player ID
Find The Highest Run for Each Year with Player ID (Beginning)
Find The Highest Run for Each Year with Player ID (End)
Just Show the Three Columns: Player_id, Year and MaxRun
Just Show the Three Columns: Player_id, Year and MaxRun
Load Data from Master.csv
A Relation Was Created Successfully
The 2nd Goal
- What are the First and last names of the player who had the highest run in each year? Display the name, player id, year and the highest run.
The 3rd Goal
- Who (First name and last name) and what year had the highest run of all the years in the dataset?
The Pig Script to Find the All-Year Max Run
Pig Script to Locate the Player Name and Year of the All-Year Max Run
The 4th Goal
- What is the average run (rounded to two decimal points) of each year?
The Pig Script to Find Year Average Run
The 5th Goal
- What is the average run (rounded to two decimal points) of all the years in the dataset?
The Script to Calculate All-Year Average Run